Skip to content

Conversation

@siddharthteotia
Copy link
Contributor

@siddharthteotia siddharthteotia commented Sep 12, 2017

cc @jacques-n , @StevenMPhillips

Problem description:

We recently encountered a problem when we were trying to add JSON files with complex schema as datasets.

Initially we started with a Float8Vector with default memory allocation of (4096 * 8) 32KB.
Went through several iterations of setSafe() to trigger a realloc() from 32KB to 64KB.

Another round of setSafe() calls to trigger a realloc() from 64KB to 128KB

After that we encountered a BigInt and promoted our vector to UnionVector. This required us to create a UnionVector with BigIntVector and Float8Vector. The latter required us to transfer the Float8Vector we were earlier working with to the Float8Vector inside the Union.

As part of transferTo(), the target Float8Vector got all the ArrowBuf state (capacity, buffer contents) etc transferred from the source vector.

Later, a realloc was triggered on the Float8Vector inside the UnionVector.

The computation inside realloc() to determine the amount of memory to be reallocated goes wrong since it makes the decision based on allocateSizeInBytes – although this vector was created as part of transfer() from 128KB source vector, allocateSizeInBytes is still at the initial/default value of 32KB

We end up allocating a 64KB buffer and attempt to copy 128KB over 64KB and seg fault when invoking setBytes() whereas our assumption was we will be reallocating to 256KB.

There is a wrong assumption in realloc() that allocateSizeInBytes is always equal to data.capacity(). The particular scenario described above exposes where this assumption could go wrong.

@wesm
Copy link
Member

wesm commented Sep 12, 2017

Since we have a 0.7.0 incoming shortly, I want to make sure that this is not a blocker. We can always release 0.7.1 with such bug fixes shortly thereafter.

Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change makes sense. We should make sure the allocation is a power of two.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should make sure this is a power of 2.

@siddharthteotia
Copy link
Contributor Author

addressed review comments

@siddharthteotia
Copy link
Contributor Author

Added unit tests.

@siddharthteotia
Copy link
Contributor Author

Without the code-changes, the unit test fails with segmentation fault (as described in the problem above)

Tests run: 21, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 0.209 s <<< FAILURE! - in org.apache.arrow.vector.TestValueVector
testReallocAfterVectorTransfer(org.apache.arrow.vector.TestValueVector) Time elapsed: 0.045 s ERROR!
java.lang.IndexOutOfBoundsException: index: 0, length: 131072 (expected: range(0, 65536))
org.apache.arrow.vector.TestValueVector.testReallocAfterVectorTransfer(TestValueVector.java:962)

@siddharthteotia siddharthteotia changed the title [JAVA] ARROW-1553: initial code changes without tests [JAVA] ARROW-1553: realloc should consider the existing buffer capacity for computing target memory requirement Sep 14, 2017
@siddharthteotia siddharthteotia changed the title [JAVA] ARROW-1553: realloc should consider the existing buffer capacity for computing target memory requirement ARROW-1553: [JAVA] realloc should consider the existing buffer capacity for computing target memory requirement Sep 14, 2017
*/

@Test
public void testReallocAfterVectorTransfer() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests for a NullableBitVector and NullableVarCharVector? Does MapVector or ListVector also need a test?

baseSize = (long)currentBufferCapacity;
}
long newAllocationSize = baseSize * 2L;
newAllocationSize = Long.highestOneBit(newAllocationSize - 1) << 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use BaseAllocator.nextPowerOfTwo()

@siddharthteotia siddharthteotia changed the title ARROW-1553: [JAVA] realloc should consider the existing buffer capacity for computing target memory requirement ARROW-1533: [JAVA] realloc should consider the existing buffer capacity for computing target memory requirement Sep 14, 2017
@siddharthteotia
Copy link
Contributor Author

@jacques-n , added tests for fixed-width vector, variable width vector and bit vector. To further test correctness, I verified that without the code changes, unit tests fail with expected segmentation fault in reAlloc().

At this point, I don't think we need tests for other vector types for the problem this patch is fixing.

However, I have two test related patches in pipeline

(1) tests for mapvector -- we don't have any unit tests for MapVector
(2) splitAndTransfer for all vector types -- we recently added for bit vector, union vector, list vector. Need to cover other types too.

I will ensure that those patches cover tests for this scenario as well.

@siddharthteotia
Copy link
Contributor Author

The unit tests added here cover both nullable and non-nullable vectors.

@siddharthteotia siddharthteotia deleted the ARROW-1553 branch September 18, 2017 22:22
asfgit pushed a commit that referenced this pull request Sep 19, 2017
…ty for computing target memory requirement

cc @jacques-n , This is same as #1097

The latter one was closed as I had to rename the branch correctly and use the correct JIRA number.

Author: siddharth <siddharth@dremio.com>

Closes #1112 from siddharthteotia/ARROW-1533 and squashes the following commits:

4c97be4 [siddharth] ARROW-1533: realloc should consider the existing buffer capacity for computing target memory requirement
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
…ty for computing target memory requirement

cc @jacques-n , This is same as apache#1097

The latter one was closed as I had to rename the branch correctly and use the correct JIRA number.

Author: siddharth <siddharth@dremio.com>

Closes apache#1112 from siddharthteotia/ARROW-1533 and squashes the following commits:

4c97be4 [siddharth] ARROW-1533: realloc should consider the existing buffer capacity for computing target memory requirement
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants